Fix: WAN 2.2 I2V/T2V training issues + Feature requests for VRAM optimization #474

relaxis · 2025-10-22T09:08:00Z

Summary

This PR includes critical bug fixes for WAN 2.2 I2V/T2V training and improvements for video training workflows.

Fixes Included

1. MoE Per-Expert LR Logging Fix ✨ NEW

Problem: LR was averaged across all param groups for MoE models, making it impossible to verify per-expert LR adaptation and state preservation.

Solution:

Detect MoE via multiple param groups (BaseSDTrainProcess.py)
Display separate LR for each expert: lr0: 5.0e-04 lr1: 3.5e-05
Shows which expert is training and tracks independent LR adaptation

Files changed: jobs/process/BaseSDTrainProcess.py

Example output:

# Before (meaningless average)
lr: 2.7e-04 loss: 8.414e-02

# After (clear per-expert visibility)
lr0: 2.8e-05 lr1: 0.0e+00 loss: 8.414e-02  # High Noise active
lr0: 5.2e-05 lr1: 1.0e-05 loss: 7.821e-02  # Low Noise now active, High preserved

2. MoE Transformer Detection Bug Fix ✨ NEW

Problem: _prepare_moe_optimizer_params() checked for .transformer_1. (dots) but lora_name uses $$ separators, so check never matched. All params went into single group instead of separate groups per expert.

Solution:

Fixed substring matching to use transformer_1 without dots
Now correctly matches names like transformer$$transformer_1$$blocks$$0$$attn1$$to_q
Creates proper separate param groups for transformer_1 and transformer_2
Enables per-expert lr_bump, min_lr, max_lr with automagic optimizer

Files changed: toolkit/lora_special.py

3. WAN 2.2 I2V Boundary Detection Fix

Problem: The toolkit was hardcoded to use T2V boundary ratio (0.875) for all WAN 2.2 models, causing incorrect timestep distribution for I2V models.

Solution:

Auto-detect I2V vs T2V models from model path
Use correct boundary ratio: 0.9 for I2V, 0.875 for T2V
Fixes dual LoRA (HIGH/LOW noise) training for I2V models

Files changed: extensions_built_in/diffusion_models/wan22/wan22_14b_model.py

4. AdamW8bit OOM Crash Fix

Problem: When OOM occurs during training, the progress bar update attempts to access loss_dict which hasn't been populated, causing a KeyError crash.

Solution:

Only update progress bar if training step succeeded (not did_oom)
Prevents crash and allows training to continue after OOM recovery

Files changed: jobs/process/BaseSDTrainProcess.py

5. Gradient Norm Logging

Problem: No visibility into gradient norms during training, making it difficult to diagnose divergence and LR issues.

Solution:

Added _calculate_grad_norm() method with comprehensive gradient tracking
Handles sparse gradients and param groups correctly
Logs grad_norm in loss_dict alongside loss
Essential for monitoring training stability with adaptive optimizers

Files changed: extensions_built_in/sd_trainer/SDTrainer.py

Features Included

1. Video-Friendly Bucket Resolutions ✨ NEW

Problem: Previous SDXL-oriented buckets caused excessive cropping for video content with common aspect ratios.

Solution:

New resolutions_video_1024 with video aspect ratios (16:9, 9:16, 4:3, 3:4)
Primary buckets only to avoid undersized assignments
Enabled by default with use_video_buckets: true

Benefits:

Better aspect ratio preservation
Reduced unnecessary cropping
Improved training quality for video datasets
Backwards compatible (can disable with use_video_buckets: false)

Files changed: toolkit/buckets.py, toolkit/data_loader.py, toolkit/dataloader_mixins.py, toolkit/config_modules.py

2. Pixel Budget Scaling ✨ NEW

Problem: Different aspect ratios used inconsistent resolutions, causing variable memory usage and suboptimal quality.

Solution:

New max_pixels_per_frame parameter for memory-based scaling
Each aspect ratio is maximized within the pixel budget
Example: max_pixels_per_frame: 589824 (768×768) optimally scales all ratios

Benefits:

Consistent memory usage across aspect ratios
Maximizes resolution for each ratio within memory constraints
Better quality without memory surprises
Only activates when max_pixels_per_frame is set

Feature Requests

UI/Config Enhancements

1. Automagic Optimizer Support

Request UI fields and validation for the automagic optimizer:

min_lr, max_lr, lr_bump, starting lr

Benefit: Automagic is highly effective for WAN 2.2 training but currently requires manual YAML editing.

2. Network Dropout Settings

Add UI field for network.dropout parameter.

Benefit: Dropout helps prevent overfitting in LoRA training, especially important for small datasets.

3. More Custom Resolutions

Add more resolution presets: 256x256, 320x320, 384x384, 448x448, 512x512

Benefit: Different resolutions have different training characteristics.

4. Training Metrics & Graph Plotting

Add built-in metric tracking and visualization:

Per-LoRA loss tracking
Gradient norm over time
Learning rate progression
Optional TensorBoard export

Benefit: Currently users must manually parse logs and create graphs.

VRAM Optimization Requests

5. Single LoRA Training Mode for WAN 2.2

Add options to load only HIGH or only LOW noise model.

Benefit: Saves ~7-10GB VRAM by not loading the unused transformer.

6. Fix RAMTorch Implementation for WAN 2.2

Currently doesn't work properly with WAN 2.2 dual transformer architecture.

Benefit: Would enable training on lower VRAM GPUs.

7. PyTorch Nightly + CUDA 13 Support (Blackwell)

Add optional requirements for PyTorch nightly, CUDA 13.x, SM_120.

Benefit: Enables RTX 50-series GPU users to utilize new optimizations.

Testing

All fixes and features have been tested in production WAN 2.2 I2V LoRA training:

59 video dataset with mixed aspect ratios
6000+ steps
Automagic optimizer with per-expert parameters
Dual LoRA (HIGH/LOW noise) training
MoE switching every 100 steps

Results:

✅ Per-expert LR display working correctly
✅ LR state preservation verified at each expert switch
✅ Video buckets properly preserve aspect ratios
✅ Pixel budget scaling maintains consistent memory usage
✅ No OOM crashes
✅ Gradient norm logging provides excellent training visibility

🤖 Generated with Claude Code

Co-Authored-By: Claude [email protected]

…dient norm logging This commit includes three critical fixes and one feature addition: 1. WAN 2.2 I2V Boundary Detection Fix: - Auto-detect I2V vs T2V models from model path - Use correct boundary ratio (0.9 for I2V, 0.875 for T2V) - Previous hardcoded T2V boundary caused training issues for I2V models - Fixes timestep distribution for dual LoRA (HIGH/LOW noise) training 2. AdamW8bit OOM Loss Access Fix: - Prevent crash when accessing loss_dict after OOM event - Only update progress bar if training step succeeded (not did_oom) - Resolves KeyError when loss_dict is not populated due to OOM 3. Gradient Norm Logging: - Add _calculate_grad_norm() method for comprehensive gradient tracking - Handles sparse gradients and param groups correctly - Logs grad_norm in loss_dict for monitoring training stability - Essential for diagnosing divergence and LR issues These fixes improve training stability and monitoring for WAN 2.2 I2V/T2V models.

This commit introduces two major improvements to bucket allocation for video training: 1. Video-friendly bucket resolutions: - New resolutions_video_1024 with common aspect ratios (16:9, 9:16, 4:3, 3:4) - Reduces cropping for video content vs the previous SDXL-oriented buckets - Primary buckets only to avoid undersized assignments 2. Pixel budget scaling for consistent memory usage: - New max_pixels_per_frame parameter allows memory-based scaling - Each aspect ratio is maximized within the pixel budget - Prevents memory issues with varying aspect ratios - Example: max_pixels_per_frame=589824 (768×768) gives optimal dims for each ratio Benefits: - Better aspect ratio preservation for video frames - Consistent memory usage across different aspect ratios - Improved training quality by reducing unnecessary cropping - Backwards compatible with existing configurations 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

This commit fixes two critical issues with Mixture of Experts (MoE) training for dual-transformer models like WAN 2.2 14B I2V: **Issue 1: Averaged LR logging masked expert-specific behavior** - Previous logging averaged LR across all param groups (both experts) - Made it impossible to verify LR was resuming correctly per expert - Example: High Noise at 0.0005, Low Noise at 0.00001 → logged as 0.00026 **Fix:** Per-expert LR display (BaseSDTrainProcess.py lines 2198-2226) - Detects MoE via multiple param groups - Shows separate LR for each expert: "lr0: 5.0e-04 lr1: 3.5e-05" - Makes expert-specific LR adaptation visible and debuggable **Issue 2: Transformer detection bug prevented param group splitting** - _prepare_moe_optimizer_params() checked for '.transformer_1.' (dots) - But lora_name uses '$$' separator: "transformer$$transformer_1$$blocks..." - Check never matched, all params went into single group → no per-expert LRs **Fix:** Corrected substring matching (lora_special.py lines 622-630) - Changed from '.transformer_1.' to 'transformer_1' substring check - Now correctly creates separate param groups for transformer_1/transformer_2 - Enables per-expert lr_bump, min_lr, max_lr with automagic optimizer **Result:** - Visible per-expert LR adaptation: lr0 and lr1 tracked independently - Proper LR state preservation when experts switch every N steps - Accurate monitoring of training progress for each expert Example output: ``` lr0: 2.8e-05 lr1: 0.0e+00 loss: 8.414e-02 # High Noise active lr0: 5.2e-05 lr1: 1.0e-05 loss: 7.821e-02 # After switch to Low Noise lr0: 5.2e-05 lr1: 3.4e-05 loss: 6.103e-02 # Low Noise adapting, High preserved ``` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

AI Toolkit Contributor and others added 3 commits October 22, 2025 10:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix: WAN 2.2 I2V/T2V training issues + Feature requests for VRAM optimization #474

Fix: WAN 2.2 I2V/T2V training issues + Feature requests for VRAM optimization #474

Uh oh!

relaxis commented Oct 22, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Fix: WAN 2.2 I2V/T2V training issues + Feature requests for VRAM optimization #474

Are you sure you want to change the base?

Fix: WAN 2.2 I2V/T2V training issues + Feature requests for VRAM optimization #474

Uh oh!

Conversation

relaxis commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fixes Included

1. MoE Per-Expert LR Logging Fix ✨ NEW

2. MoE Transformer Detection Bug Fix ✨ NEW

3. WAN 2.2 I2V Boundary Detection Fix

4. AdamW8bit OOM Crash Fix

5. Gradient Norm Logging

Features Included

1. Video-Friendly Bucket Resolutions ✨ NEW

2. Pixel Budget Scaling ✨ NEW

Feature Requests

UI/Config Enhancements

1. Automagic Optimizer Support

2. Network Dropout Settings

3. More Custom Resolutions

4. Training Metrics & Graph Plotting

VRAM Optimization Requests

5. Single LoRA Training Mode for WAN 2.2

6. Fix RAMTorch Implementation for WAN 2.2

7. PyTorch Nightly + CUDA 13 Support (Blackwell)

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

relaxis commented Oct 22, 2025 •

edited

Loading